Advanced Retrieval With LangChain

Advanced Retrieval With LangChain#

Let’s go over a few more complex and advanced retrieval methods with LangChain.

There is no one right way to retrieve data - it’ll depend on your application so take some time to think about it before you jump in

Let’s have some fun

Multi Query - Given a single user query, use an LLM to synthetically generate multiple other queries. Use each one of the new queries to retrieve documents, take the union of those documents for the final context of your prompt
Contextual Compression - Fluff remover. Normal retrieval but with an extra step of pulling out relevant information from each returned document. This makes each relevant document smaller for your final prompt (which increases information density)
Parent Document Retriever - Split and embed small chunks (for maximum information density), then return the parent documents (or larger chunks) those small chunks come from
Ensemble Retriever - Combine multiple retrievers together
Self-Query - When the retriever infers filters from a users query and applies those filters to the underlying data

# Unzip data folder

import zipfile
with zipfile.ZipFile('../../data.zip', 'r') as zip_ref:
    zip_ref.extractall('..')

from dotenv import load_dotenv
import os

load_dotenv()

openai_api_key=os.getenv('OPENAI_API_KEY', 'YourAPIKey')

Load up our texts and documents#

Then chunk them, and put them into a vector store

from langchain.document_loaders import DirectoryLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

/Users/gregorykamradt/opt/anaconda3/lib/python3.9/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.7.2) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
  warnings.warn(

We’re going to load up Paul Graham’s essays. In this repo there are various sizes of folders (PaulGrahamEssaysSmall, PaulGrahamEssaysMedium, PaulGrahamEssaysLarge or PaulGrahamEssays for the full set.)

loader = DirectoryLoader('../data/PaulGrahamEssaysLarge/', glob="**/*.txt", show_progress=True)

docs = loader.load()

100%|███████████████████████████████████████████| 49/49 [00:30<00:00,  1.62it/s]

print (f"You have {len(docs)} essays loaded")

You have 49 essays loaded

Then we’ll split up our text into smaller sized chunks

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)

print (f"Your {len(docs)} documents have been split into {len(splits)} chunks")

Your 49 documents have been split into 471 chunks

if 'vectordb' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectordb.delete_collection()

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

MultiQuery#

This retrieval method will generated 3 additional questions to get a total of 4 queries (with the users included) that will be used to go retrieve documents. This is helpful when you want to retrieve documents which are similar in meaning to your question.

from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.prompts import PromptTemplate
# Set logging for the queries
import logging

Doing some logging to see the other questions that were generated. I tried to find a way to get these via a model property but couldn’t, lmk if you find a way!

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

Then we set up the MultiQueryRetriever which will generate other questions for us

question = "What is the authors view on the early stages of a startup?"
llm = ChatOpenAI(temperature=0)

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

unique_docs = retriever_from_llm.get_relevant_documents(query=question)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. How does the author perceive the early stages of a startup?', "2. What are the author's thoughts on the initial phases of a startup?", "3. What is the author's perspective on the beginning stages of a startup?"]

Check out how there are other questions which are related to but slightly different than the question I asked.

Let’s see how many docs were actually returned

len(unique_docs)

Ok now let’s put those docs into a prompt template which we’ll use as context

prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

llm.predict(text=PROMPT.format_prompt(
    context=unique_docs,
    question=question
).text)

'The author believes that it is important for startups to release an early version of their product quickly and then improve it based on user feedback. They emphasize the importance of getting version 1 done fast and state that startups that are too slow to release often fail.'

Contextual Compression#

Then we’ll move onto contextual compression. This will take the chunk that you’ve made (above) and compress it’s information down to the parts relevant to your query.

Say that you have a chunk that has 3 topics within it, you only really care about one of them though, this compressor will look at your query, see that you only need one of the 3 topics, then extract & return that one topic.

This one is a bit more expensive because each doc returned will get processed an additional time (to pull out the relevant data)

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

We first need to set up our compressor, it’s cool that it’s a separate object because that means you can use it elsewhere outside this retriever as well.

llm = ChatOpenAI(temperature=0, model='gpt-4')

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=vectordb.as_retriever())

First, an example of compression. Below we have one of our splits that we made above

splits[0].page_content

"July 2006I've discovered a handy test for figuring out what you're addicted\n\nto.  Imagine you were going to spend the weekend at a friend's house\n\non a little island off the coast of Maine.  There are no shops on\n\nthe island and you won't be able to leave while you're there.  Also,\n\nyou've never been to this house before, so you can't assume it will\n\nhave more than any house might.What, besides clothes and toiletries, do you make a point of packing?\n\nThat's what you're addicted to.  For example, if you find yourself\n\npacking a bottle of vodka (just in case), you may want to stop and\n\nthink about that.For me the list is four things: books, earplugs, a notebook, and a\n\npen.There are other things I might bring if I thought of it, like music,\n\nor tea, but I can live without them.  I'm not so addicted to caffeine\n\nthat I wouldn't risk the house not having any tea, just for a\n\nweekend.Quiet is another matter.  I realize it seems a bit eccentric to\n\ntake earplugs on a trip to an island off the coast of Maine.  If\n\nanywhere should be quiet, that should.  But what if the person in\n\nthe next room snored?  What if there was a kid playing basketball?\n\n(Thump, thump, thump... thump.)  Why risk it?  Earplugs are small.Sometimes I can think with noise.  If I already have momentum on\n\nsome project, I can work in noisy places.  I can edit an essay or\n\ndebug code in an airport.  But airports are not so bad: most of the\n\nnoise is whitish.  I couldn't work with the sound of a sitcom coming"

Now we are going to pass a question to it and with that question we will compress the doc. The cool part is this doc will be contextually compressed, meaning the resulting file will only have the information relevant to the question.

compressor.compress_documents(documents=[splits[0]], query="test for what you like to do")

/Users/gregorykamradt/opt/anaconda3/lib/python3.9/site-packages/langchain/chains/llm.py:280: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
  warnings.warn(

[Document(page_content="I've discovered a handy test for figuring out what you're addicted to.  Imagine you were going to spend the weekend at a friend's house on a little island off the coast of Maine.  There are no shops on the island and you won't be able to leave while you're there.  Also, you've never been to this house before, so you can't assume it will have more than any house might.What, besides clothes and toiletries, do you make a point of packing? That's what you're addicted to.", metadata={'source': '../data/PaulGrahamEssaysLarge/island.txt'})]

Great so we had a long document, now we have a shorter document with more dense information. Great for getting rid of the fluff. Let’s try it out on our essays

question = "What is the authors view on the early stages of a startup?"
compressed_docs = compression_retriever.get_relevant_documents(question)

print (len(compressed_docs))
compressed_docs

[Document(page_content='The thing I probably repeat most is this recipe for a startup: get\n\na version 1 out fast, then improve it based on users\' reactions.By "release early" I don\'t mean you should release something full\n\nof bugs, but that you should release something minimal.  Users hate\n\nbugs, but they don\'t seem to mind a minimal version 1, if there\'s\n\nmore coming soon.There are several reasons it pays to get version 1 done fast.  One\n\nis that this is simply the right way to write software, whether for\n\na startup or not.  I\'ve been repeating that since 1993, and I haven\'t seen much since to\n\ncontradict it.  I\'ve seen a lot of startups die because they were\n\ntoo slow to release stuff, and none because they were too quick.', metadata={'source': '../data/PaulGrahamEssaysLarge/startuplessons.txt'}),
 Document(page_content='"Bring us your startups early," said Google\'s speaker at the Startup School.  They\'re quite\n\nexplicit about it: they like to acquire startups at just the point\n\nwhere they would do a Series A round.  (The Series A round is the\n\nfirst round of real VC funding; it usually happens in the first\n\nyear.) It is a brilliant strategy, and one that other big technology\n\ncompanies will no doubt try to duplicate.', metadata={'source': '../data/PaulGrahamEssaysLarge/vcsqueeze.txt'}),
 Document(page_content="Building office buildings for technology companies won't get you a\n\nsilicon valley, because the key stage in the life of a startup\n\nhappens before they want that kind of space.  The key stage is when\n\nthey're three guys operating out of an apartment.  Wherever the\n\nstartup is when it gets funded, it will stay.  The defining quality", metadata={'source': '../data/PaulGrahamEssaysLarge/siliconvalley.txt'}),
 Document(page_content='of a startup, which in its raw form is more a distraction than a motivator.', metadata={'source': '../data/PaulGrahamEssaysLarge/vcsqueeze.txt'})]

We now have 4 docs but they are shorter and only contain the information that is relevant to our query.

Let’s put it in our prompt template again.

prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

llm.predict(text=PROMPT.format_prompt(
    context=compressed_docs,
    question=question
).text)

"The author believes that the early stages of a startup are crucial. He advises startups to release a minimal version 1 of their product quickly and then improve it based on user feedback. He also mentions that many startups fail because they are too slow to release their products. Furthermore, he notes that the key stage in a startup's life often happens when it's still a small operation, possibly operating out of an apartment."

Parent Document Retriever#

LangChain documentation does a great job describing this - my minor edits below:

When you split your docs, you generally may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.

But at the same time you may want to have information around those small chunks to keep context of the longer document.

The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

Note that “parent document” refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# This text splitter is used to create the child documents. They should be small chunk size.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="return_full_documents",
    embedding_function=OpenAIEmbeddings()
)

# The storage layer for the parent documents
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    child_splitter=child_splitter,
)

Now we will add the whole essays that we split above. We haven’t chunked these essays yet, but the .add_documents will do the small chunking for us with the child_splitter above

retriever.add_documents(docs, ids=None)

Now if we were to put in a question or query, we’ll get small chunks returned

sub_docs = vectorstore.similarity_search("what is some investing advice?")

sub_docs

[Document(page_content="people there are rich, or expect to be when their options vest.\n\nOrdinary employees find it very hard to recommend an acquisition;\n\nit's just too annoying to see a bunch of twenty year olds get rich\n\nwhen you're still working for salary.  Even if it's the right thing\n\nfor your company to do.The Solution(s)Bad as things look now, there is a way for VCs to save themselves.", metadata={'doc_id': 'a4372dda-31dc-477f-9239-2ac45d11f3db', 'source': '../data/PaulGrahamEssaysLarge/vcsqueeze.txt'}),
 Document(page_content="the product is expensive to develop or sell, or simply because\n\nthey're wasteful.If you're paying attention, you'll be asking at this point not just\n\nhow to avoid the fatal pinch, but how to avoid being default dead.\n\nThat one is easy: don't hire too fast.  Hiring too fast is by far\n\nthe biggest killer of startups that raise money.", metadata={'doc_id': '0314fd17-e53a-4c6e-9d80-2a721b8800df', 'source': '../data/PaulGrahamEssaysLarge/aord.txt'}),
 Document(page_content="[1]\n\nBut investors are so fickle that you can never\n\ndo more than start to count on them.  Sometimes something about your\n\nbusiness will spook investors even if your growth is great.  So no\n\nmatter how good your growth is, you can never safely treat fundraising\n\nas more than a plan A. You should always have a plan B as well: you\n\nshould know (as in write down) precisely what you'll need to do to", metadata={'doc_id': '0314fd17-e53a-4c6e-9d80-2a721b8800df', 'source': '../data/PaulGrahamEssaysLarge/aord.txt'}),
 Document(page_content="commitment.If an acquirer thinks you're going to stick around no matter what,\n\nthey'll be more likely to buy you, because if they don't and you\n\nstick around, you'll probably grow, your price will go up, and\n\nthey'll be left wishing they'd bought you earlier.  Ditto for\n\ninvestors.  What really motivates investors, even big VCs, is not\n\nthe hope of good returns, but the fear of missing out.\n\n[6]", metadata={'doc_id': 'ad7c0de6-ec3d-4637-9dea-8de2d37fa505', 'source': '../data/PaulGrahamEssaysLarge/startuplessons.txt'})]

Look how small those chunks are. Now we want to get the parent doc which those small docs are a part of.

retrieved_docs = retriever.get_relevant_documents("what is some investing advice?")

I’m going to only do the first doc to save space, but there are more waiting for you. Keep in mind that LangChain will do the union of docs, so if you have two child docs from the same parent doc, you’ll only return the parent doc once, not twice.

retrieved_docs[0].page_content[:1000]

"November 2005In the next few years, venture capital funds will find themselves\n\nsqueezed from four directions.  They're already stuck with a seller's\n\nmarket, because of the huge amounts they raised at the end of the\n\nBubble and still haven't invested.  This by itself is not the end\n\nof the world.  In fact, it's just a more extreme version of the\n\nnorm\n\nin the VC business: too much money chasing too few deals.Unfortunately, those few deals now want less and less money, because\n\nit's getting so cheap to start a startup.  The four causes: open\n\nsource, which makes software free; Moore's law, which makes hardware\n\ngeometrically closer to free; the Web, which makes promotion free\n\nif you're good; and better languages, which make development a lot\n\ncheaper.When we started our startup in 1995, the first three were our biggest\n\nexpenses.  We had to pay $5000 for the Netscape Commerce Server,\n\nthe only software that then supported secure http connections.  We\n\npaid $3000 for a server with a 90"

However here we got the full document back. Sometimes this will be too long and we actually just want to get a larger chunk instead. Let’s do that.

Notice the chunk size difference between the parent splitter and child splitter.

# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="return_split_parent_documents", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()

This will set up our retriever for us

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Now this time when we add documents two things will happen

Larger chunks - We’ll split our docs into large chunks
Smaller chunks - We’ll split our docs into smaller chunks

Both of them will be combined.

retriever.add_documents(docs)

Let’s check out how many documents we have now

len(list(store.yield_keys()))

Then let’s go get our small chunks to make sure it’s working and see how long they are

sub_docs = vectorstore.similarity_search("what is some investing advice?")
sub_docs

[Document(page_content="people there are rich, or expect to be when their options vest.\n\nOrdinary employees find it very hard to recommend an acquisition;\n\nit's just too annoying to see a bunch of twenty year olds get rich\n\nwhen you're still working for salary.  Even if it's the right thing\n\nfor your company to do.The Solution(s)Bad as things look now, there is a way for VCs to save themselves.", metadata={'doc_id': 'd7cfabc8-b712-4fd1-8f1c-917c66cfdb68', 'source': '../data/PaulGrahamEssaysLarge/vcsqueeze.txt'}),
 Document(page_content="commitment.If an acquirer thinks you're going to stick around no matter what,\n\nthey'll be more likely to buy you, because if they don't and you\n\nstick around, you'll probably grow, your price will go up, and\n\nthey'll be left wishing they'd bought you earlier.  Ditto for\n\ninvestors.  What really motivates investors, even big VCs, is not\n\nthe hope of good returns, but the fear of missing out.\n\n[6]", metadata={'doc_id': 'f6df5dfa-47e4-46be-9e77-8303a512b8c1', 'source': '../data/PaulGrahamEssaysLarge/startuplessons.txt'}),
 Document(page_content="April 2006(This essay is derived from a talk at the 2006\n\nStartup School.)The startups we've funded so far are pretty quick, but they seem\n\nquicker to learn some lessons than others.  I think it's because\n\nsome things about startups are kind of counterintuitive.We've now\n\ninvested\n\nin enough companies that I've learned a trick\n\nfor determining which points are the counterintuitive ones:", metadata={'doc_id': 'd9e056c8-fee1-43e7-bbb1-8c3c435d663c', 'source': '../data/PaulGrahamEssaysLarge/startuplessons.txt'}),
 Document(page_content="the title of this essay, you already know most of what you need to\n\nknow about M&A in the first year.Notes[1]\n\nI'm not saying you should never sell.  I'm saying you should\n\nbe clear in your own mind about whether you want to sell or not,\n\nand not be led by manipulation or wishful thinking into trying to\n\nsell earlier than you otherwise would have.[2]", metadata={'doc_id': '88d1d80d-dbb7-4c10-bfee-48df7f551452', 'source': '../data/PaulGrahamEssaysLarge/corpdev.txt'})]

Now, let’s do the full process, we’ll see what small chunks are generated, but then return the larger chunks as our relevant documents

larger_chunk_relevant_docs = retriever.get_relevant_documents("what is some investing advice?")
larger_chunk_relevant_docs[0]

Document(page_content='means VCs are now in the business of finding promising little 2-3\n\nman startups and pumping them up into companies that cost $100\n\nmillion to acquire.   They didn\'t mean to be in this business; it\'s\n\njust what their business has evolved into.Hence the fourth problem: the acquirers have begun to realize they\n\ncan buy wholesale.  Why should they wait for VCs to make the startups\n\nthey want more expensive?  Most of what the VCs add, acquirers don\'t\n\nwant anyway.  The acquirers already have brand recognition and HR\n\ndepartments.  What they really want is the software and the developers,\n\nand that\'s what the startup is in the early phase: concentrated\n\nsoftware and developers.Google, typically, seems to have been the first to figure this out.\n\n"Bring us your startups early," said Google\'s speaker at the Startup School.  They\'re quite\n\nexplicit about it: they like to acquire startups at just the point\n\nwhere they would do a Series A round.  (The Series A round is the\n\nfirst round of real VC funding; it usually happens in the first\n\nyear.) It is a brilliant strategy, and one that other big technology\n\ncompanies will no doubt try to duplicate.  Unless they want to have\n\nstill more of their lunch eaten by Google.Of course, Google has an advantage in buying startups: a lot of the\n\npeople there are rich, or expect to be when their options vest.\n\nOrdinary employees find it very hard to recommend an acquisition;\n\nit\'s just too annoying to see a bunch of twenty year olds get rich\n\nwhen you\'re still working for salary.  Even if it\'s the right thing\n\nfor your company to do.The Solution(s)Bad as things look now, there is a way for VCs to save themselves.\n\nThey need to do two things, one of which won\'t surprise them, and\n\nanother that will seem an anathema.Let\'s start with the obvious one: lobby to get Sarbanes-Oxley\n\nloosened.  This law was created to prevent future Enrons, not to\n\ndestroy the IPO market.  Since the IPO market was practically dead', metadata={'source': '../data/PaulGrahamEssaysLarge/vcsqueeze.txt'})

prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

question = "what is some investing advice?"

llm.predict(text=PROMPT.format_prompt(
    context=larger_chunk_relevant_docs,
    question=question
).text)

"One piece of investing advice is to release a minimal version 1 of a product quickly, then improve it based on users' reactions. This is because it's dangerous to guess what users will like without knowing them. Another advice is for Venture Capitalists to lobby to get Sarbanes-Oxley loosened, as this law was not created to destroy the IPO market. Additionally, it's advised not to sell a startup too early, but to be clear about whether you want to sell or not, and not be led by manipulation or wishful thinking."

Ensemble Retriever#

The next one on our list combines multiple retrievers together. The goal here is to see what multiple methods return, then pull them together for (hopefully) better results.

You may need to install bm25 with !pip install rank_bm25

from langchain.retrievers import BM25Retriever, EnsembleRetriever

We’ll use a BM25 retriever for this one which is really good at keyword matching (vs semantic). When you combine this method with regular semantic search it’s known as hybrid search.

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(splits, embedding)
vectordb = vectordb.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, vectordb], weights=[0.5, 0.5])

ensemble_docs = ensemble_retriever.get_relevant_documents("what is some investing advice?")
len(ensemble_docs)

prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

question = "what is some investing advice?"

llm.predict(text=PROMPT.format_prompt(
    context=ensemble_docs,
    question=question
).text)

"One piece of investing advice is to make a larger number of smaller investments instead of a handful of giant ones. It is also suggested to fund younger, more technical founders instead of MBAs and let the founders remain as CEO. Another advice is that the best sources of seed funding are successful startup founders, as they can also provide valuable advice. However, it's important to be aware of the changing nature of the world and industries, as what may seem like a bad idea initially could become a good one due to changes in the world."

Self Querying#

The last one we’ll look at today is self querying. This is when the retriever has the ability to query itself. It does this so it can use filters when doing it’s final query.

This means it’ll use the users query for semantic search, but also its own query for filtering (so the user doesn’t have to give a structured filter).

You may need to install !pip install lark

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

embeddings = OpenAIEmbeddings()
llm = ChatOpenAI(temperature=0, model='gpt-4')

if 'vectorstore' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectorstore.delete_collection()

vectorstore = Chroma.from_documents(
    splits, embeddings
)

Below is the information on the fitlers available. This will help the model know which filters to semantically search for

metadata_field_info=[
    AttributeInfo(
        name="source",
        description="The filename of the essay", 
        type="string or list[string]", 
    ),
]

document_content_description = "Essays from Paul Graham"
retriever = SelfQueryRetriever.from_llm(llm,
                                        vectorstore,
                                        document_content_description,
                                        metadata_field_info,
                                        verbose=True,
                                        enable_limit=True)

retriever.get_relevant_documents("Return only 1 essay. What is one thing you can do to figure out what you like to do from source '../data/PaulGrahamEssaysLarge/island.txt'")

query='figure out what you like to do' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='../data/PaulGrahamEssaysLarge/island.txt') limit=1

[Document(page_content="if I could only figure out what.As for books, I know the house would probably have something to\n\nread.  On the average trip I bring four books and only read one of\n\nthem, because I find new books to read en route.  Really bringing\n\nbooks is insurance.I realize this dependence on books is not entirely good—that what\n\nI need them for is distraction.  The books I bring on trips are\n\noften quite virtuous, the sort of stuff that might be assigned\n\nreading in a college class.  But I know my motives aren't virtuous.\n\nI bring books because if the world gets boring I need to be able\n\nto slip into another distilled by some writer.  It's like eating\n\njam when you know you should be eating fruit.There is a point where I'll do without books.  I was walking in\n\nsome steep mountains once, and decided I'd rather just think, if I\n\nwas bored, rather than carry a single unnecessary ounce.  It wasn't\n\nso bad.  I found I could entertain myself by having ideas instead\n\nof reading other people's.  If you stop eating jam, fruit starts\n\nto taste better.So maybe I'll try not bringing books on some future trip.  They're\n\ngoing to have to pry the plugs out of my cold, dead ears, however.", metadata={'source': '../data/PaulGrahamEssaysLarge/island.txt'})]

It’s kind of annoying to have to put in the full file name, a user doesn’t want to do that. Let’s change source to essay and the file path w/ the essay name

import re

for split in splits:
    split.metadata['essay'] = re.search(r'[^/]+(?=\.\w+$)', split.metadata['source']).group()

Ok now that we did that, let’s make a new field info config

metadata_field_info=[
    AttributeInfo(
        name="essay",
        description="The name of the essay", 
        type="string or list[string]", 
    ),
]

if 'vectorstore' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectorstore.delete_collection()

vectorstore = Chroma.from_documents(
    splits, embeddings
)

document_content_description = "Essays from Paul Graham"
retriever = SelfQueryRetriever.from_llm(llm,
                                        vectorstore,
                                        document_content_description,
                                        metadata_field_info,
                                        verbose=True,
                                        enable_limit=True)

retriever.get_relevant_documents("Tell me about investment advice the 'worked' essay? return only 1")

query='investment advice' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='essay', value='worked') limit=1

[Document(page_content='should make a larger number of smaller investments instead of a\n\nhandful of giant ones, they should be funding younger, more technical\n\nfounders instead of MBAs, they should let the founders remain as\n\nCEO, and so on.One of my tricks for writing essays had always been to give talks.\n\nThe prospect of having to stand up in front of a group of people\n\nand tell them something that won\'t waste their time is a great\n\nspur to the imagination. When the Harvard Computer Society, the\n\nundergrad computer club, asked me to give a talk, I decided I would\n\ntell them how to start a startup. Maybe they\'d be able to avoid the\n\nworst of the mistakes we\'d made.So I gave this talk, in the course of which I told them that the\n\nbest sources of seed funding were successful startup founders,\n\nbecause then they\'d be sources of advice too. Whereupon it seemed\n\nthey were all looking expectantly at me. Horrified at the prospect\n\nof having my inbox flooded by business plans (if I\'d only known),\n\nI blurted out "But not me!" and went on with the talk. But afterward\n\nit occurred to me that I should really stop procrastinating about\n\nangel investing. I\'d been meaning to since Yahoo bought us, and now\n\nit was 7 years later and I still hadn\'t done one angel investment.Meanwhile I had been scheming with Robert and Trevor about projects\n\nwe could work on together. I missed working with them, and it seemed', metadata={'essay': 'worked', 'source': '../data/PaulGrahamEssaysLarge/worked.txt'})]

Awesome! It returned it back for us. It’s a bit rigid because you need to put in the exact name of the file/essay you want to get. You could make a pre-step and infer the correct essay from the users choice but this is out of scope for now and application specific.